Statistical significance of variables driving systematic variation in high-dimensional data
نویسندگان
چکیده
MOTIVATION There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. RESULTS We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. AVAILABILITY AND IMPLEMENTATION An R software package, called jackstraw, is available in CRAN. CONTACT [email protected].
منابع مشابه
Evaluating the Prevalence of Musculoskeletal Disorders in Drivers Systematic Review and Meta-analysis
Introduction: Work-related musculoskeletal disorders (WMSDs) are any disorders or injuries to the musculoskeletal system due to working procedure or conditions. WMSDs is one of the main causes of occupational injuries and disability in advanced and developing countries. The present study was conducted to evaluate the prevalence of musculoskeletal disorders in drivers in order to achieve complet...
متن کاملIdentification of Traffic Users Values in Yazd City and its Impact on Traffic (Focusing on Reducing Losses and Driving Accidents)
Introduction: Vehicle and motorcycle accidents are one of the main causes of mortality in the world and the highest rates of mortality are attributed to developing countries. On average, every day, there are three thousand car deaths in the world, and mortality related to vehicles are the 11th most common cause of death in developing countries. In our country, the death toll from driving accide...
متن کاملIdentification of Traffic Users Values in Yazd City and its Impact on Traffic (Focusing on Reducing Losses and Driving Accidents)
Introduction: Vehicle and motorcycle accidents are one of the main causes of mortality in the world and the highest rates of mortality are attributed to developing countries. On average, every day, there are three thousand car deaths in the world, and mortality related to vehicles are the 11th most common cause of death in developing countries. In our country, the death toll from driving accide...
متن کاملStudy of relationship between spatial variation of sediment yield and disrtribution of landform components (Case study: Qarasu watershed)
Extended abstract 1- Introduction Erosion and sedimentation in watersheds has caused many problems in healthy and sustainable use of water and soil resources, and is considered as one of the major threats to global economic and environmental sustainability. The increase of sediment yield in watersheds has consequences such as dam filling, river diversion, conductivity capacity reduction of wa...
متن کاملComparison of Ordinal Response Modeling Methods like Decision Trees, Ordinal Forest and L1 Penalized Continuation Ratio Regression in High Dimensional Data
Background: Response variables in most medical and health-related research have an ordinal nature. Conventional modeling methods assume predictor variables to be independent, and consider a large number of samples (n) compared to the number of covariates (p). Therefore, it is not possible to use conventional models for high dimensional genetic data in which p > n. The present study compared th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 31 شماره
صفحات -
تاریخ انتشار 2015